-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MPS support with fused kernels #76
Conversation
@andrewkchan Thank you for your fantastic work in making the MPS backend finally work! I did a quick test for this PR and found it works almost faultlessly from my laptop. My test setup is based on a customized M2 Max (30 of the 38 GPU cores). CMake and XCode build:
CMake and make build:
Perhaps this PR is the first time to make it become possible to natively running 3DGS workload on apple silicon. I'm super excited for this new feature and expecting it would add a new impact in this area soon 🚀 |
Great work! very excited to start using this :) |
Agree! Very exciting @andrewkchan, thanks for the PR. I can't wait to run this on my M1 🙏 I will test/review this sometimes by today or tomorrow. Quick question, I noticed you made some changes in RasterizeGaussiansCPU and changed |
Currently getting this error while trying to run:
Also noticed that the definition for |
The error might be coupled with your macos version (13.3 vs 14.4). I was able to run through metal compile with the latest 14.4.1. Maybe we should consider tweaking cmake like ggerganov/llama.cpp#6370 did. Also, it seems like there is a memory leaking problem on metal and I'm trying to resolve it now. |
Yep, I'm on 13.2. I'll try a few things, see if I can get it to compile. Also, saw that we use |
Hmm, I don't remember changing this code. Looks like it was from the experimental commit that I based my changes on 472a45a
Yeah, I had ported over the ND rasterize function instead of the rasterize by accident but then decided to just use that. It's possible this was causing the slight numerical differences in unit tests. Happy to port over the rasterize_forward_kernel if needed.
Curious what problem you are running into! Since as noted in the OP I'm intentionally leaking some resources forever. |
No need, but could certainly be done as part of another PR. I'm still trying to compile this on 13.2; I've isolated the problem to a call to |
What is the expected memory usage for something like the banana example? It's not great if there is a leak. But I'm not able to find anything using the XCode "Leaks" tool except for two allocations of 128 byte objects. And I thought that memory usage is generally expected to increase over training because the number of gaussians will increase with scheduled splits. |
We can use this table as a baseline #3 (comment) |
I ended up upgrading to 14.4 and it now runs 🥳 I think there might be something off with the rasterize forward pass however, this is the result of the metal renderer after 100 iters on the banana dataset (you can do even just 10 iters):
Compared to the CPU run:
Looks like a width/height mismatch. I had these issues when writing the CPU rasterizer, I recommend using the |
Nice catch! You are totally right. Fixed and the loss is much lower after 100 iters now - |
I'm thinking the size of the combined memory we got from mps backend is likely within the correct range. With emptyCache(), I was able to achieve a slightly lower memory footprint (7.8GB vs 8.7GB). The latest Pytorch doesn't provide a corresponding C++ API (such as c10::mps::MPSCachingAllocator::emptyCache()) to explicitly release the MPS cache. Current workaround (at::detail::getMPSHooks().emptyCache()) was discovered from this python API and I'm still uncertain if it's truly effective. |
Nice! This is looking pretty amazing and can be merged. Thanks for everyone's help. 👍 Memory improvements, as well as a possible port of the |
This PR adds support for GPU acceleration via the MPS backend on MacOS per #60.
gsplat
PyTorch ops with fused kernels for gaussian projection, rasterization, etc. to metal performance shaders.Here's the speedup on my M3 Pro with MacOS Sonoma 14.3. Wall clock goes from 5 minutes to 5 seconds!
GPU
CPU
Some implementation notes:
It was very useful to generate an XCode project from the CMakeLists.txt for debugging this for future reference, since XCode provides some nice GPU tools.